How to build your own AutoML library in Python from scratch

AutoML libraries and services have already entered the world of machine learning. They are very useful tools for a Data Scientist, but sometimes they must be adapted to fit the needs of the business context a Data Scientist work in. That’s why you would need to build your own AutoML library.

Let’s see how to do it in Python.

What must an AutoML library do?

An AutoML library is any piece of software that automates some of the hardest (and boring) parts of a machine learning pipeline. Despite doing all these tasks manually, using AutoML will speed the machine learning process avoiding the risk of mistakes.

An AutoML library must automatically perform these actions:

  • Blank filling
  • Encoding of categorical variables
  • Scaling of numerical variables
  • Feature selection
  • Model selection
  • Hyperparameter tuning

The idea is that an AutoML library tries all the combinations of these parameters, measuring the average model performance using k-fold cross-validation and selecting the best set of values. So, it’s an optimization procedure in a grid of settings.

The approach

In this article, I’m going to cover only a classification pipeline using this grid of settings:

  • Blank filling of numerical variables: mean or median value
  • Blank filling of categorical variables: most frequent value
  • Scaling: Normalization, Standardization or Robust scaling
  • Filter-based feature selection using ANOVA
  • Models used: logistic regression, KNN, random forest, Gradient Boosting, Binary decision tree, SVM with linear kernel

Each model comes with its own hyperparameters, that must be optimized together with the pre-processing parameters. So, the idea is that all these parameters become the hyperparameters of a large machine learning pipeline that includes pre-processing and the models. Even the model itself becomes a hyperparameter of this pipeline. So, we have translated our problem into a hyperparameter optimization problem, that we can solve.

The pipeline hyperparameter space is very large, so we are going to use a random search to find the best set of values of such hyperparameters.

Our object will take a Pandas dataframe in input for training purposes and its “fit” method will perform the required optimization to find the best model and the best settings for the pre-processing phase.

Let’s now see the code.

The code

We are going to create an object called MyAutoMLClassifier and we’re going to train and test it on the breast cancer dataset. You can find the whole code on my GitHub repository.

Let’s first import some libraries:

Now, we can start defining the MyAutoMLClassifier class. Its constructor will accept a scoring function that will be used in k-fold CV and the number of iterations of the random search. For this example, their default values will be “balanced accuracy” and 50.

Now we can start building the “fit” method, that is the most important one.

First, we have to detect the distinct values of the categorical variables in order to apply one-hot encoding.

Now we have to define a pre-processing pipeline for the categorical variables. This pipeline will clean the blanks using the most frequent value and will one-hot encode them.

At the same time, we are going to define a pipeline for the numerical variables, that will be cleaned according to a parameter that will be defined later and scaled according to a scaler that we’ll decide in the random search part.

Everything is finally included in the ColumnTransformer settings, that will perform all the pre-processing part.

Finally, we have to define the ML pipeline, that is built by the pre-processing phase, the feature selection and the model itself. We can set the model to LogisticRegression at the moment; it will be changed later by random search.

Then, we can calculate the number of the features (we’ll need it for the feature selection part) and create an empty list that will contain the optimization grid according to the syntax needed by RandomSearchCV.

Now we can start adding models to our optimization grid.

Let’s start with logistic regression:

As we can see, we are creating an object that will change the scaling among RobustScaler, StandardScaler and MinMaxscaler. Then will change the cleaning strategy among mean and median and will select the features from 1 to the total number of features with steps of 5. Finally, the model itself is set. The random search will check random combinations of this grid, searching for the one that maximizes the performance metrics in cross-validation.

We can now add other models with their own pre-processing needs and hyperparameters. For example, trees don’t require any scaling, but SVM do.

We can add as many models as we want, optimizing their hyperparameters in the same grid.

So, we are searching for the best combination of cleaning strategy, scaling procedure, set of features, model and hyperparameters’ values, everything in the same search procedure. This is the core of any AutoML library and can be extended as you want.

We have now a complete optimization grid, so we can finally apply the random search to find the best pipeline parameters and save the results in some properties of our object. The random search will apply a 5-fold cross-validation with the scoring function and the number of iterations we have chosen in the class constructor.

The “fit” method is finished.

We can now add the “predict” and “predict_proba” methods just like any other sklearn model and our MyAutoMLClassifier is complete.

We can now import our dataset, split it into training and test, create an instance of MyAutoMLClassifier and fit it.

With just the last line of code, we make all the difficult stuff of AutoML.

Once the model is fitted, we can calculate the balanced accuracy in the test set and see which model and pre-processing parameters have been selected:

As you can see, we are using a Gradient Boosting Tree Classifier with all the features and a numerical cleaning strategy based on the median value. The learning rate of the model is 0.1 and the number of estimators is 125. This is the result of AutoML.

Conclusions

AutoML libraries are very useful tools for a Data Scientist and can actually make us save a lot of time. According to the business context we work on, we may need to work only with certain models or cleaning procedures, so we need our customized version of AutoML.

The example showed in this article can be easily adapted to a regression problem and can be integrated with other models like neural networks.

1 Comment

  1. Brilliant information here! Hopefully you wont stop the flow of such magical material!

Leave a Reply

Your email address will not be published. Required fields are marked *